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ABSTRACT 



The necessity for a more complete understand- 
ing of language as the basis for machine translation 
and computational linguistics is stressed. Other 
benefits which will result from this longterm re- 
search -» including information retrieval and auto- 
matic classification -- are also mentioned. 
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FOREWORD 



This paper is based on a lecture given at 
The University of Texas Science Conference, November 20 
1964. The conference provided a means for scientists 
in The University of Texas faculties to get acquainted 
with one another and to listen to brief expositions 
of research in progress on the campus. Part of the 
research mentioned here was performed under grant 
NSF GN- 5 O 80 
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INTRODUCTION 



This paper deals with a relatively new approach 
to the study of language, one underlying the work of 
the Linguistics Research Center « This view regards 
language in such a way that it can be manipulated 
with a computer o Yet the view cannot be related to 
technological developments, for it preceded the computer. 
The resultant approach to language has often been 
called structural linguistics. 

Language may be studied from many points of 
view. One may wish to acquire a graceful mastery of 
one or more languages, either for writing special 
kinds of texts called poems, or simply to impress as 
well as inform an audience. One may wish to learn 
about the history of specific languages, how English 
is related to Hindi, Greek, Armenian or Irish. The 
most prominent interest in language in Western culture 
arose from a desire to understand venerated texts, 
primarily texts in Hebrew, Greek, Latin, the Bible 
and the classics. The understanding of these texts 
led to the development of special techniques and 
attitudes about language. For us, oddly enough in 
contrast with the Greeks and Romans, the written 
language has seemed more fundamental than the spoken, 
and we have spent more time learning to read than to 
speak languages, whether French, German, Russian, or 
the cited classical languages. Further, since in our 
day written materials are broken up into units called 
words, these seem to us the fundamental entities of 
language. 
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Moreover, since these languages, especially 
Latin seem somehow to be model languages, we have 
sought a mastery o£ current languages, including our 
own, through descriptions-->grammarS“-which are modeled 
on the grammar o£ Latino To master a language, 
including English, our grammari&ns, teachers and students 
note its resemblances to Latin and also the di££erences 
£rom it. This procedure may be compared with that o£ a 
geographer who adopts one location, £or example. New 
York City, as the ideal and describes all other locations 
by their resemblances to ito From such a geographer 
we would not get a map o£ London, Paris or Moscow, but 
rather various maps o£ New York modi£ied in accordance 
with deviations £rom New York in these cities. 

It is not my aim to present a critique o£ 
any view o£ language, or o£ our methods o£ teaching 
languages p or even o£ any type o£ research on language. 
But since we have all studied languages in accordance 
with the Latin-based approach, we regard any language 
in accordance with the views given us in our schools. 
These views must there£ore be speci£ied i£ we are to 
understand one another. I might also mention that the 
£irst attempts at computer processing o£ language 
£ailed because the scholars concerned viewed the es- 
sential problem as the manipulation o£ words. 

Besides discussing a somewhat di££erent 
approach to language I will touch on the linguistic 
investigations it has prompted and is continuing to 
require. I will also deal brie£ly with the require- 
ments this approach is making on computer programming 
and may make on computer technology. 
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FORMAL STRUCTURE 



Possibly the most important teature o£ struc- 
tural linguistics is the understanding that language 
has a formal structure, composed of various sub-struc- 
tures » In these structures the function or value of 
any entity is determined largely by its relationship 
with other entities o Entities then are not defined 
by their relationship to the outside world; a noun 
for example is not defined as the name of a person, 
place or thing » Viewing language in this way seems 
to a linguist somewhat similar to presenting mathe- 
matics through concrete objects, never to add 2 and 2, 
but always two apples to two apples, and so on. There 
is little doubt that our understanding o£ numbers, 
our progress in mathematics, would have been hampered 
if we had dealt with them only in connection with the 
outside world rather than as abstract signs o Lin- 
guists hold that such a view of language has impeded 
our understanding of ito 

Some decades ago a few scholars began to 
examine language as a system of signs whose function 
was specified by their interrelationships o This 
approach to language-^this theory, if you wish-- 
would define a noun in any given language by its 
relationship to other entities in that language. A 
noun in English, for example, might be defined as an 
entity with certain relationships to inflectional 
elements, to a Z-like entity in the plural: arm : 

arms , or in the possessive: man : man^s . Other 

languages might not have nominal inflection and ac- 
cordingly would 'hot have a class of nouns. In Japa- 
nese, for example, only verbs are inflected; we cannot 
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then speak o£ a class o£ in£lected nouns o Another 
basis o£ de£inition o£ entities might be by their 
relationship to independent entities^ for example p 
articles; the, a^ or to verbs o By such a definition 
man is a noun because it can follow the ; went on the 
other hand is noto Using sucn a procedure, we can 
identify nouns in Japanese o Further, we can also 

identify larger acceptable entities, such as sentences, 
by their entities and the interrelationships of these; 
men talk , for example, is an acceptable English sen- 
tence, but not men happily , or even men happy o 

When this approach to language was pursued, the 
work of linguistics came to be looked on as the deter- 
mination of the entities of any language and their 
interrelationships , 

Two requirements are necessary before one 
can deal usefully with language in this wayo We must 
first determine whether the materials are genuine, 
whether for example an English speaker permits the 
sequence; men talk <> Next, we ascertain whether the 
entities in this sequence have a characteristic meaning o 
In such determination we elicit, comparable sequences, 
eogo men walk , then talk , and so ono With such con- 
trasting sequences we would satisfy ourselves that m 
is a characteristic marker, for it is the only entity 
distinguishing men talk from then talk , or distin- 
guishing man talks (a statement an anthropologist 
might make) from Ann talks (a statement one might make 
about a very young lady). Similarly, t is a charac- 
teristic sound marker, distinguishing ten from men , 
talk from walk, and so ono 
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In addition to determining the characteristic 
entities of sound in a language, we may also determine 
larger entities, for example, talk as opposed to walk . 
These differ from entities like the first consonants 
of men , ten and then in that they have established 
relationships to certain concepts. Briefly, we say 
they have meaning. The first consonants of men , ten 
and then do not. They serve to distinguish meanings, 
but we cannot associate with them any given concepts, 
such as ’animateness*, ’number* or ’temporality*. 

Rigorous techniques for determining entities 
of both kinds have been developed. 

When such entities are specified in a given 
language, linguists set out to determine their role-- 
one might say, their properties. In English there 
are about forty entities like m and t. These may be 
regarded as signs, comparable to other kinds of signs 
man uses, e,g. 3 4. Just as a mathematician might 
investigate relationships between such units in a 
given number system, setting up various classes, e,g. 
primes, so a linguist might investigate the role of 
such entities in a given language. He might determine 
what relationships t has with regard to the other 
entities of sound. In English, for example, t may 
precede e if n follows, but not e alone; there is no 
English sequence t^. Nor are there sequences like 
tne , etn , and so on. Other such problems will occur 
to any linguist, mathematician, or to anyone who en- 
joys manipulating signs. Yet few such problems have 
been investigated, even in a widely used language like 
English, to say nothing of 5,000 other languages. We 
have not had the personal nor the physical resources. 
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A similar range of problems might be cited 
in the investigation of entities like walk talk take 
brake and so on* We might find sequences in which man : 
Jnen precede any of these, e«go the man walked , the man 
braked around curves . etCo But we do not find sequences 
like: Walkman > talkman , takeman paralleling brakemaU p 

A complete description of any language would specify 
which of such sequences occure 

At this stage of his investigations a linguist 
does not deal with meaning. He has determined that 
brake differs from take > that it has a meaning; but in 
examining possible sequences like brakeman he deals 
only with its properties of occurrence. Nonetheless 
this second type of investigation, noting the inter- 
relationships between entities like take a brake p 
man, -ing , provides even more problems than does the 
first 0 

Still other entities must be identified in 
language and investigated similarly. But the two types 
of entities I have selected may exemplify the approach 
of structural linguistics , 

Those structural linguists who concern them- 
selves exclusively with the study of sets of linguistic 
entities and their interrelationships are sometimes 
called mathematical or computational linguists. Other 
linguists may deal with other language problems--the 
pronunciation of talk , walk in various areas, the 
stylistic differences between talk and speak , and 
so on. But a computational linguist limits his concern 
to sets of entities and their interrelationships. 



I£ his approach to language is valid, in 
using language men acquire a number o£ entities and 
learn how to manipulate them in relation to one 
another^ Further, i£ c machine could be devised 
which would store the number o£ entities stored by 
man, with rules speci£ying their relationships to 
other entities, the machine might simulate man’s mas 
tery o£ language » 
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SIMULATION 



As is well-known, about twenty years ago a 
machine was developed which seemed to have the essen- 
tial capabilities, the computeto Possibly the manip- 
ulation of language would never have engaged the at- 
tention of computer specialists if the problem of 
rapid intercommunication had not become so prominent » 

To be sure computation centers might have found it 
amusing in time to have a few language games available 
for visitors, when they became bored with tic-tac-toe, 
checkers, chess or gOo But since the scientists, who 
were nursing along the infant computer and contemplating 
uses for it when it matured, had just been involved in 
international struggles which pointed up the importance 
of reading the scientific publications of the other 
side, they suggested that the computer might solve the 
problem of intercommunication. The computer therefore 
was looked on as the machine to take over the unin- 
spiring activity of translation; supporting agencies 
provided time on computers and a small amount of money 
to research workers, whose goal was to be machine trans- 
lation, This seemingly overriding goal was the prime 
activity for which language specialists might use 
computers. To the outside world today- still, linguists 
doing research with computers are working on machine 
trar.slation. 

With a million words a day of important 
materials awaiting translation from Russian to English, 
let alone materials of secondary importance or materials 
in Chinese, Japanese, German, French and so on, machine 
translation would be a fine accomplishment. But the 
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problem requires a bit o£ preliminary worko We may view 
the essential requirement one of synthesizing sentences o 
This activity may be compared to synthesizing protein 
molecules--though nothing like the expenditure of time 
and money has been applied to linguistic investigation 
as to that of chemistry. 

One of the first problems we may note is that 
language is not a simple linear structure s rather » it 
consists of numerous structures. One is made up of 
entities like t m and so on, which might be compared 
with atoms; this structure contains relatively few 
entities, but their rules of interrelationship are com* 
plex, A second structure is made up of entities like 
then men walk , which might be compared with radicals; 
this structure contains a great number of entities, 
possibly with somewhat less complex rules of inter- 
relationships. From these the smallest free form of 
language is constructed, the sentence. In making 
sentences, in using language man has somehow learned to 
master both of these structures. More, he has learned 
the relationships of the entities in the second struc- 
ture to a totally different structure, that of concepts. 
Since computer manipulation of language is a type of 
simulation, before we can use computers effectively for 
managing language, we must understand how these various 
structures relate to one another, how language functions. 
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LANGUAGE DATA PROCESSING 



Of the various problems, some are straight-- 
forward, for example, the amassing of entities and their 
rules. We spend the first ten years of our lives ac» 
quiring control over one language, continue to add to 
our stock of entities, and rarely achieve mastery over 
a second language. To give the computer similar 
opportunities we must have large-scale programs, by 
means of which we can store genuine materials and 
materials with a characteristic meaning. A great deal 
of effort has been expended by members of the Linguistics 
Research Center over the past five years to develop the 
system of programs which handle the data of language and 
their interrelationships. 

Man has taken care of this problem very clev*= 
erly. He reduces language to a set of entities of 
sound, about forty in a language, and accordingly has 
relatively few building blocks to control. Unfortu° 
nately no machine has been devised to match man°s dis- 
criminatory powers in managing the entities of speech. 
Accordingly at present, machine manipulation of language 
must be based on the second level v^ith its tremendous 
number of entities. 

Since our work in the Center is still experi- 
mental, it is difficult to forecast how many such 
entities must be stored in a computer. Some estimates 
put the number of chemical terms in German at two 
million; the rules for relating them to other entities 
of the language will obviously be fewer. 

In the relatively small computers available today 
the rules indicating interrelationships and vocabulary 
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items of B. language must be stored on magnetic tape© 

The programming system developed at the Linguistics 
Research Center has been successful in analyzing 
materials of limited vocabulary and syntactic complex- 
ity, consisting of about 50,000 rules and items in 
each language o 

Until one has dealt with the highly rigorous 
computer it is almost impossible to visualize the prob- 
lems involved in a thorough analysis of language » A 
simple example from a physics textbook may illustrate 

some of them: 



Loudness is the property of sound de 
termined by the effect of the power 
of the sound waves on our earSo 



Let us suppose that we write a computer routine--a 
syntactic rule--relating of to a following noun, as in 
of sound 5 the rule will not then handle the sequence of 
the power , for here of is followed by the definite 
articleo If we modify our rule accordingly, we still 
have not handled the use of of in of the sound waves , 
for here the article is followed by a noun used as 
adjective o In putting a sentence like this into a 
computer, we must therefore provide for sequences 
of of and a variety of entities. Obviously our rules 
cannot be simple, though our example may have been. 

Another entity of the sentence, on may 
illustrate a further type of problem. If we relate 

r 

on to the surrounding entities, we arrive at the 
possible sequence: the sound waves on our ears . This 

sequence might compare to that of t he flag wav es on the 



4-2 




f lag-»pole or the policeman waves on the traffic o But 
such relationships which would appear identical to a 
computer trouble us| the sentence from our physics 
textbook seems absurd to us if on is related to vjaves 
in either of these wayso We have learned that on is 
related to effect and that the meaningful sequence is 
effect on our ears o We scarcely need to discuss the in- 
adequate translations that would be produced if such 
sentences were put word-for-word into German, Russian 
or other languages » 

Since an English speaker understands his 
language in this way, a computer must also be prepared 
to manipulate it accordingly. To arrange such manip- 
ulations we m.ust describe English far more precisely 
than has ever been done before. The required detail of 
description has never been provided before because 
native speakers master such sequences, and x^?e are 
charitable to foreigners who learn inadequate English 
from our inadequate grammars. But if a computer makes 
any requirement, it is for precision, A computer would 
not be happy with our simple sentence until it knows 
l^?hat to do with every entity, including on, Conse- 
quently a linguist has to determine the role of an 
entity first of all, then describe it. Since even the 
large dictionaries which have been produced for Eng- 
lish, German and the other widely studied languages 
have not described these languages adequately, lin- 
guists in the Linguistics Research Center are now at 
work producing such descriptions--writing rules for 
English, German, Russian and other languages. Figures 
1-5 illustrate the procedures involved in making 



a syntactic analysis o£ an English sentence, in 
accordance with a grammar written by Dr* Wayne Tosh 
o£ our Centero 

The resultant rules are many and intricate • 

When produced, they must be handled by the computer, 
but kept independent o£ it through use o£ generalized 
computer programming. I£, £or example, a specialized 
program were v/ritten £or handling combinations or 
prepositions plus nouns or prepositions plus articles 
plus nouns, it would have to be revised to handle 
sequences o£ prepositions plus articles plus noun- 
adjectives plus nouns, as in o^ the sound waves o The 
Linguistic Research System, produced under the direction 
o£ Eugene Pendergra£t, was devised to meet this require- 
ment o£ generalization. With this system linguistic 
rules, independent o£ specialized computer programs, 
may be produced to handle phrases o£ various length-- 
preposition plus noun as in o£ sound , preposition plus 
article plus noun, as in of t{^ power , and longer phrases 
of the sound waves . Other instructions alert the 
computer to watch out for prepositions like on after 
a noun like effect . A chart of the system indicates 
the demands placed on computer manipulation of language 
and also one of the results of five years of work, 
supported by the US Army Electronics Laboratories and 
by the National Science Foundation. 

One of our practical problems is to achieve 
an understanding by outsiders of the use of computers 
in processing linguistic material. Most of us had 
our notions about scientific procedures determined by 
elementary science class, in which we probably used a 
Bunsen burner. This early activity seems to leave the 
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indelible impression that scientific equipment, for 
example a cc'”puter, is like a Bunsen burner^ There is 
little variety of use for a Bunsen burner»°it merely 
heats things. The heat isn’t different if one lights 
it with a flint or a match»°if one strikes the match 
on a piece of sandpaper or one’s thumbnail. By anal° 
ogy it is assumed that the machine is the essential 
part of computation; after one switches on the power, 
a computer can cook your data as well as mine* Yet 
in language, as in the social sciences, the important 
part of computation is the program. The importance 
of how one utilizes a machine rather than the makeup 
of the machine may be one of the essential differences 
between work in the social sciences and that in the 
natural sciences. Possibly software and hardware 
sciences would be more appropriate names. 
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AUTOMATIC CLASSIFICATION 



Yet even a system with programs o£ the com- 
plexity o£ those illustrated is inadequate £or handling 
language 0 In a sentence o£ £ewer than twenty elements p 
£or example p there are more than a million possibili- 
ties o£ analysis o But this £igure, large though it is, 
£ails to take into account an analysis £or meanings £or 
determining among other things that in our sentence 
sound is similar in meaning to noise rather than to 
healthy B valid, as in a sound mind or a sound theory o 
When we handle the multitude p£ entities necessary in 
analysis o£ meaning we will deal with many more 
possibilities o£ interpretation than are £ound £or orio 
In managing these, our present computers would be chokedo 
Even the larger computers now becoming available would 
deal with the quantities o£ data slowlyo Adequate 
speed seems possible only by re£inements o£ computer 
theory and in improved techniques o£ classi£icationo 

A £ew years ago Ro Mo Needham, o£ Cambridge 
University, pointed the way to such classi£ication 
with his clumping theoryo His procedures are being 
expanded £or application to larger sets o£ data by 
A, Go and No Dale o£ our Centero Details are pro- 
vided in the paper, A Programming System £or Auto- 
matic Classi£ication with Applications in Linguistic 
and In£ormation Retr i eval Research n LRC 64 WTM°4p 
written by Ao Go and N, Dale and Eo Do Pendergra£to 
With other papers, this is available £rom the Centero 
Even the procedures described in this paper require 
a great deal o£ computer time £or handling a relatively 
small number o£ entities o Further research is being 
pursued to improve and speed up the procedure o I have 
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time merely to mention such research ^ but would also 
like to point out that it was not even envisaged be° 
fore language analysis with computers was undertaken o 
The amount of linguistic data which must be manipulated 
as well as its complexity p has pointed up the need for 
research in fields of applied logic or mathematics that 
would not have been related to language investigation 
a few years agOo Students in the sciences p for whom 
the required language courses may seem to have little 
lasting value p might well consider applying themselves 
to these problemso Solutions will follow only from a 
quantitative approach p generally lacking in previous 
students of language o 
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ANALYSIS OF MEANING 



But though we face numerous problems in the 
development of computer systems and in the theoretical 
work which must be carried out before systems and com- 
puters can manage efficiently the huge and complex 
amounts of data^ our largest problems remain in the 
understanding of language o Chief among these is the 
treatment of meaningo In dealing with meaning we are 
probably a bit farther along than Plato p though not 
mucho Our dictionaries largely side=step the pro° 
blem; they set out to provide synonyms p whether mono- 
lingually or bilinguallyo Since they are fairly effec- 
tive tools, we can handle translation of a sort without 
understanding meaningo But for competent translation, 
for automating indexing and abstracting, for problems 
in artificial intelligence, we will have to control 
meaning as we now do syntactic relationships o 

Our theoretical approach is ciearo We assume 
that language is structured at the level of meaning 
similarly to its structure at the levels of sound 
and syntaXo Again, we do not relate entities to the 
outside world, but to concepts o Still the problems 
of analysis are staggering o The sheer magnitude of 
the data--ali human knowledge--is troublesome enougho 
But how to classify it? By specialties as we do in real 
life? Should one computer handle nuclear physics, 
another the physics of light, another molecular biology, 
and so on? (If we did, we would not welcome a physicist 
who also concerns himself "with biology) o But if we 
divide the universe of concepts in this way, what type 
of hierarchical arrangement should we use? If, for 
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example-.', we define man as ‘male human being*, should 
we distinguish between the concepts ®male* and *human 
being’ because ’male* is automatically supplied in such 
sequences as ’he was a man whOooo, the king is a man 
whOo*o’? It will be difficult to answer such questions 
until we carry on a fair bit of investigations Be- 
fore then, it will even be difficult to pose the proper 
questions 0 
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ACCOMPLISHMENTS 



It may be disappointing for non^linguists 
to hear that linguistic work has scarcely begunj with 
or without computers o Be we have some accomplishments o 
Some theoretical positions seem supportable « We are 
on our way to an extensive and flexible linguistic 
research system, and expect to have adequate computers 
to make use of ito The traditionally lone linguist 
is beginning to work with specialists in related fields o 
Even the achievement of analyzing language syntactically 
may seem small o But our tools are still inadequate o 
Given satisfactory scanning devices and more powerful 
computers we will be able to use our system for ana° 
lyzing more than a snatch of language o Already straight-- 
forward linguistic applications may be carried out, 
if adequate resources are provided; any book may be auto® 
matically indexed, and accordingly among other things 
more readily proof°reado Bibliographical and other 
data may be managed automatically; in a pilot project, 
the Center has listed all Slavic books in the University 
Library, so that anyone interested in Tolstoy, in 
Russian novels or the like, may be given an immediate 
print =out of the titles o Other such projects need only 
financial support for achievement o The chief aim of 
the Center, however, is to continue theoretical investi- 
gations of language and data processing techniques, and 
the preparation of computer programs, so that ultimately 
a computer will be able to manipulate language with 
somewhat the same proficiency as does mano 
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THE UNIVERSITY OF TEXAS 
LINGUISTICS RESEARCH SYSTEM 




Circles represent magnetic 
data tapes. Boxes represent 
programs; those with heavy 
lines are scheduled for 
completion by the end of 
this year. 




